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Abstract 



The dialects of Madagascar belong to the Greater Barito East group of the Austronesian family and it 
is widely accepted that the Island was colonized by Indonesian sailors after a maritime trek which probably 
took place around 650 CE. The language most closely related to Malagasy dialects is Maanyan but also 
Malay is strongly related especially for what concerns navigation terms. Since the Maanyan Dayaks live 
along the Barito river in Kalimantan (Borneo) and they do not possess the necessary skill for long maritime 
navigation, probably they were brought as subordinates by Malay sailors. 
J . In a recent paper we compared 23 different Malagasy dialects in order to determine the time and the 

landing area of the first colonization. In this research we use new data and new methods to confirm that 
the landing took place on the south-east coast of the Island. Furthermore, we are able to state here that it 
^ . is unlikely that there were multiple settlements and, therefore, colonization consisted in a single founding 

event. 

To reach our goal we find out the internal kinship relations among all the 23 Malagasy dialects and we 
also find out the different kinship degrees of the 23 dialects versus Malay and Maanyan. The method used 
^ ' is an automated version of the lexicostatistic approach. The data concerning Madagascar were collected by 

OO , the author at the beginning of 2010 and consist of Swadesh lists of 200 items for 23 dialects covering all 

areas of the Island. The lists for Maanyan and Malay were obtained from published datasets integrated by 
author's interviews. 



Keywords: Malagasy dialects, Austronesian languages, taxonomy of languages, lexicostatisties, Malagasy 
origins. 



^ '. Introduction 

H , 

5t 1 Malagasy language (as well all its dialects) belongs to the Austronesian linguistic family. This was definitively 
established in [1] where it is also shown a particularly close relationship between Malagasy and Maanyan which 
is spoken by a Dayak community of Borneo. A relevant contribution also comes from loanwords of other 
Indonesian languages as Ngaju Dayak, Buginese, Javanese and Malay [U|3]. In particular, Malay is very well 
represented in the domain of navigation terms. A very small amount of the vocabulary can be associated with 
non- Austronesian languages (for example Bantu languages for what concerns faunal names [3]). 

The Indonesian colonizers reached Madagascar by a maritime trek at a time that we estimated in a recent 
paper [5] to be around 650 CE, a date which is within the widely accepted range of time [U[3]. In the same paper 
we found a strong indication that the landing area was in the south-east of the Island. This was established 
assuming that the homeland is the area exhibiting the maximum of current linguistic diversity. Diversity was 
measured by comparing lexical and geographical distances. 

In this paper we confirm the south-east location as the area of landing (were the population dispersal took 
origin). Furthermore, we find out that colonization consisted in a single founding event. Therefore, it is unlikely 
that there were multiple settlements and eventual subsequent landings did not alter consistently the linguistic 
equilibrium. Our study starts from the consideration that Maanyan speakers, which live along the rivers of 



Kalimantan, do not have the necessary skills for long-distance maritime navigation. The most reasonable 
explanation [21 13 is that they were brought as subordinates by Malay sailors. For this reason we reexamine 
the internal kinship relations among all the 23 Malagasy dialects but we also perform a comparison of all these 
variants with respect both Malay and Maanyan. These new output concerning Malagasy dialects and their 
relations with the two Indonesian languages are examined with new methods which all confirm that the landing 
took place on the south-east coast of the Island. 

The vocabulary used for the present study was collected by the author with the invaluable help of Joselina 
Soafara Nere at the beginning of 2010. The dataset, which can be found in [B], consists of 200 words Swadesh 
lists [7] for 23 dialects of Malagasy from all the areas of the Island. The orthographical conventions are those 
of standard Malagasy. Most of the informants were able to write the words directly using these conventions, 
while a few of them benefited from the help of one ore more fellow townsmen. For any dialect list two different 
speakers have been interviewed, their complete list is provided in Appendix B while the locations can be seen in 
Fig. 2. Finally, the lists for Maanyan and Malay were obtained by published dataset [^ integrated by author's 
interviews. 

Method 

The method that we use [51 llOj is based on a lexical comparison of languages by means of an automated measure 
of distance between pairs of words with same meaning contained is their Swadesh lists. The use of Swadesh 
lists [7] in lexicostatistics is popular since half a century. They are lists of words associated to the same M 
meanings, (the original Swadesh choice was M = 200) which concern the basic activities of humans. Comparing 
the two lists corresponding to a pair of languages it is possible to determine the percentage of shared cognates 
which is a measure of their lexical distance. A recent example of the use of Swadesh lists and cognates counting 
to construct language trees are the studies of Gray and Atkinson [11] and Gray and Jordan |T2] . 

The idea of measuring relationships among languages using vocabulary is much older than lexicostatistics 
and it seems to have its roots in the work of the French explorer Dumont D'Urville. He collected comparative 
word lists during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the geographical 
division of the Pacific |13j . he proposed a method to measure the degree of relation among languages. He used 
a core vocabulary of 115 terms, then he assigned a distance from to 1 to any pair of words with the same 
meaning and finally he was able to determine the degree of relation between any pair of languages. 

Our automated method (see Appendix A for details) works as follows: for any language we write down 
a Swadesh list, then we compare words with same meaning belonging to different languages only considering 
orthographic differences. This approach is motivated by the analogy with genetics: the vocabulary has the role 
of DNA and the comparison is simply made by measuring the differences between the DNA of the two languages. 
There are various advantages: the first is that, at variance with previous methods, it avoids subjectivity, the 
second is that results can be replicated by other scholars assuming that the database is the same, the third is 
that it is not requested a specific expertize in linguistic, and the last, but surely not the least, is that it allows 
for a rapid comparison of a very large number of languages (or dialects). 

If a family of languages is considered, all the information is encoded in a matrix whose entries are the pairwise 
lexical distances, nevertheless, this information is not manifest and it has to be extracted. The ubiquitous 
approach to this problem is to transform the matrix information in a phylogenetic tree. 

Nevertheless, in this transformation, part of the information may be lost because transfer among languages 
is not exclusively vertical (as in mtDNA transmission from mother to child) but it also can be horizontal 
(borrowings and, in extreme cases, creolization). Another approach is the geometric one [14[ [S] that results 
from Structural Component Analysis (SCA) that we have recently proposed. This approach encodes the matrix 
information into the positions of the languages in a n-dimensional space. For large n one recovers all the matrix 
content, but a low dimensionality, typically n=2 or n=3, is sufficient to grasp all the relevant information. The 
results in this paper mostly rely to a direct investigation of the entries of the matrix and to simple averages 
over them. 



Malagasy dialects 

The number of Malagasy dialects we consider is N=23, therefore, the output of our method, when applied only 
to these variants is a matrix with N{N — l)/2 ~ 253 non-trivial entries representing all the possible lexical 
distances among dialects. This matrix is explicitly shown in Appendix A. 

The information concerning the vertical transmission of vocabulary from the proto-Malagasy to the con- 
temporary dialects can be extracted by a phylogenetic approach. There are various possible choices for the 
algorithm for the reconstruction of the family tree (see [5] for a discussion of this point), we show in Fig. 1 
the output of the Unweighted Pair Group Method Average (UPGMA). In this figure the name of the dialect is 
followed by the name of the town were it was collected. The input data for the UPGMA tree are the pairwise 
separation times obtained from the lexical distances by means of a simple logarithmic rule ([9l [10]). The ab- 
solute time-scale is calibrated by the results of the SCA analysis, which indicate a separation date 650 CE [S]. 
The phylogenetic tree in Fig. 1 interestingly shows a main partition of Malagasy dialects in two main branches 
(east-center-north and south-west) at variance with previous studies which gave a different partitioning [T3] 
(indeed, the results in [15] coincide with ours if a correct phylogeny is applied, see [S] for a discussion of this 
point.) Then, each of two branches splits, in turn, in two sub-branches whose leaves are associated to different 
colors. In order to demonstrate the strict correspondence of this cladistic with the geography, we display a map 
of Madagascar (Fig. 2) where the locations of the 23 dialects are indicated with the same colors of the leaves 
in Fig. 1. We remark the relative isolation of the Antandroy variant (yellow). 

Up to know, we only have shown the consistency of the approach which can be appreciated by comparison 
between Fig. 1 and Fig. 2. We start our investigation by computing the average distance of each of the dialects 
from all the others (see Fig. 3). Antandroy has the largest average distance, confirming that it is the overall 
most deviant variant (something which is also commonly pointed out by other Malagasy speakers). We further 
note that the smallest average distance is for Merina (official language), Betsileo and Bara, which are all spoken 
on the highlands. The fact that the Merina has the smallest average distance is possibly partially explained by 
the fact that this variant is the official one. However, as we will show later by means of a comparison of Malagasy 
dialects with Malay and Maanyan, this cannot be the only explanation. More interestingly we remark that the 
Antambohoaka and Antaimoro variants, which are spoken in Mananjary and Manakara also have a very small 
average distance from the other dialects. Both this dialects are spoken in the south-east coast of Madagascar 
in a relatively isolated position and, therefore, this is the first evidence for south-east as the homeland of the 
Malagasy language and, likely, as the location of the first settlement. 

The identification of the southeastern coast of Madagascar as the landing area for the Indonesian colonizers is 
supported by geographical considerations. In fact, there is an Indian Ocean current which goes from Sumatra to 
Madagascar. When Mount Krakatoa erupted in 1883, pumice arrived on south-east coast where the Mananjary 
River opens into the sea. Furthermore, during the Second World War, the same area saw the arrival of pieces 
of wreckage from ships sailing between Java and Sumatra that had been bombed by the Japanese air-force. 
Notice that the mouth of the Mananjary River is where the town of Mananjary is presently located, and it is 
also close to Manakara. The Indonesian ancestors of today Malagasy probably profited of this current, which 
they possibly entered sailing throw the Sunda strait. 

Dialects, Malay and Maanyan 

The classification of Malagasy (and its dialects) among the Greater Barito East languages of Borneo as well 
as the particularly close relationship with Maanyan is beyond doubt. However, Malagasy also underwent 
influences from other Indonesian languages as Ngaju Dayak, Javanese, Buginese and, particularly, from Malay 
which exhibits the most relevant relationship after Maanyan. 

If we consider the 23 dialects together with Malay and Maanyan, not only we have to compute the 253 
internal distances, but also we have to determine the 23x2=56 distances of any of the dialects from the two 
Indonesian languages. These new distances are displayed in Fig. 4. 

First of all we observe, as expected, that the largest of the distances from Maanyan is smaller then the 
smallest of the distances from Malay. This simply reflects the fact that Malagasy is first of all an East Barito 



language. Then wc also observe that Malagasy dialects seem to have almost the same relative composition. In 
fact, all the points in Fig. 4 have almost the same distance from Malay/ distance from Maanyan ratio. This is a 
strong indication that the linguistic makeup is substantially the same for all dialects and, therefore, that they 
all originated by the same founding population of which they reflect the initial composition. The conclusion 
is that the founding event was likely a single one and subsequent immigration did not alter significantly the 
linguistic composition. 

Indeed, looking more carefully, one can detect a little less Malay in the north since red circles have a 
larger ratio with respect all the others. This cannot a be a consequence of a larger African influence in the 
vocabulary due to the active trade with the continent and Comoros Islands. In this case both the Maanyan and 
Malay component of the vocabulary would be affected. Instead, this may be the effect of Malay trading which, 
according to Adelaar [3J |3] , continued for several centuries after colonization. 

Noticeably, some dialects changed less with respect to the proto-language (Antananarivo, Fianarantsoa, 
Manajary, Manakara), in fact, their distances both from Maanyan and Malay are smaller then those of the 
other dialects. This is probably the most relevant phenomenon, and wc underline that the variants which are 
less distant on average with respect to the other dialects (Fig. 3) are also less distant with respect to Malay 
and Maanyan (Fig. 4). Therefore, the fact that Merina is closer to the other dialects cannot be merely justified 
by the fact that it is the official variant. 

We have checked whether the picture which emerges from Fig. 4 is confirmed by comparing with other 
related Indonesian languages. The result is positive, and in particular the dialects of Manajary, Manakara, 
Antananarivo and Fianarantsoa seem to be closer to most of the Indonesian languages which we compare them 
to. Note that Manajary and Manakara are both in the previously identified landing area on the south-east 
coast while Antananarivo and Fianarantsoa are in the central highlands of the Island. This suggests a scenario 
according to which there was a migration on the highlands of Madagascar (Bctsilco and Imerina regions) shortly 
after the landing on the south-east coast (Manakara, Manajary). 

In conclusion, both average distances in Fig. 3 and distances from related Indonesian language point to the 
south-east coast as the area of the first settlement. This is the same indication which comes from the fact that 
linguistic diversity is higher in that region (see [5]). 

Finally, we remark that the Antandroy variant (Ambovombe) is the most distant from Maanyan and among 
the most distant dialects from Malay, showing again to be the most deviant dialect. It is not clear whether its 
divergent evolution was due to internal factors or to specific language contacts which are still to be identified. 

Outlook 

The main open problem concerning Malagasy is to determine the composition of the population which settled 
the Island. Adelaar writes : Malay influence persisted for several centuries after the migration. But, except for 
this Malay influence, most influence on Malagasy from other Indonesian languages seems to be pre-migratory. 
(...) I also believe it possible that the early migrants from south-east Asia came not exclusively from the south- 
east Barito area, in fact, that south-east Barito speakers may not even have constituted a majority among these 
migrants, but rather formed a nuclear group which was later reinforced by south-east Asian migrants with a 
possibly different linguistic and cultural background (and, of course, by African migrants). Whatever view one 
may hold on how the early Malagasy were influenced by other Indonesians, it seems necessary that we at least 
develop a more cosmopolitan view on the Indonesian origins of the Malagasy. A south-east Barito origin is 
beyond dispute, but this is of course only one aspect of what Malagasy dialects and cultures reflect today. Later 
influences were manifold, and some of these influences, African as well as Indonesian, were so strong that they 
have molded the Malagasy language and culture in all its variety into something new, something for the analysis 
of which a south-east Barito origin has become a factor of little explanatory value. 

In order to clarify the problem raised by Adelaar, it is necessary to understand the Malagasy relationships 
with other Indonesian languages (and possibly African ones) . The fact that the use of some words is limited 
to one or more dialects was already taken into account in previous studies. For example it is known that the 
word alika which refers to dog in Merina (the official variant) is replaced by the word amboa of African origin 
in most dialects. Nevertheless, the study of Malagasy dialects in comparison with Indonesian languages is a 



still largely unexplored field of research. Each dialect may provide pieces of information about the history the 
language, eventually allowing us to for track the various linguistic influences experienced by Malagasy since the 
initial colonization of the Island. 

An other open problem concerns the pre-Indonesian ancestral population. It is still debated whether the 
island was inhabited before the Indonesian colonization. In case the answer is positive it may be possible 
to track the aboriginal vocabulary in the dialects. For example, the Mikea are the only hunter-gatherers in 
Madagascar, and it is unclear whether they are a relic of the aboriginal pre-Indonesian population or just 
'ordinary' Malagasy who switched to a simpler economy for historical reasons. If the first hypothesis is the 
correct one, they should show some residual aboriginal vocabulary in their dialect, and the same is expected 
for the neighboring populations of Vezo and Masikoro. 
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Appendix A 

We start by our definition of lexical distance between two words, which is a variant of the Levenshtein distance 
|16) . The Levenshtein distance is simply the minimum number of insertions, deletions, or substitutions of a 
single character needed to transform one word into the other. Our distance is obtained by a normalization. 
More precisely, given two words uji and uj2, their distance d{uji^u)2) is defined as 

,, . dLiuJl,UJ2) 

d{uJl,UJ2)^ (1) 

l{u)i,u:2) 

where di^{uji,U2) is their standard Levenshtein distance and 1{uji,ljJ2) is the number of characters of the longer 
of the two words wi and 0^2 ■ Therefore, the distance can take any value between and 1. 

The reason of the normalization can be understood by the following example. Consider the case of two 
words with the same length in which a single substitution transforms one word into the other. If they are short, 
let's say 2 characters, they are very different. On the contrary, if they are long, let's say 8 characters, it is 
reasonable to say that they are very similar. Without normalization, their distance would be the same, equal 
to 1, regardless of their length. Instead, introducing the normalization factor, in the first case the distance is 
i whereas in the second, it is much smaller and equal to i 

We use distance between pairs of words, as defined above, to construct the lexical distances of languages. 
For any language we prepare a list of words associated to the same M meanings (we adopt the original Swadesh 
choice ofM = 200). 

Assume that the number of languages is N and any language in the group is labeled by a Greek letter (say 
a) and any word of that language by a^ with 1 < i < M . The same index i corresponds to the same meaning 
in all languages i.e., two words ai and Pj in the languages a and j3 have the same meaning if t = j. 

The lexical distance between two languages is then defined as 

M 

D{a,p)^-Y.<^{a,^f^^) (2) 

i=l 

It can be seen that D{a, /3) is always in the interval [0,1] and obviously D(a, a) = 0. 

The result of the analysis described above is a iVx A^ upper triangular matrix whose entries are the A^(A^— 1)/2 
non-trivial lexical distances D{a,l3) between all pairs of languages. 

The matrix of the 23 Malagasy dialects, with entries multiplied by 1000, is the following: 
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where the number- variant correspondence is: 

1 Antambohoaka (Mananjary), 2 Antaisaka (Vangaindrano), 3 Antaimoro (Manakara), 4 Zafisoro (Farafangana) , 
5 Bara (Betroka), 6 Betsileo (Fianarantsoa) , 7 Vezo (Toliara), 8 Sihanaka ( Ambatondranzaka) , 9 Tsimihety 
(Mandritsara), 10 Mahafaly (Ampanihy), 11 Merina (Antananarivo), 12 Sakalava (Morondava), 13 Bctsimis- 
araka (Fenoarivo-Est), 14 Antanosy (Tolagnaro), 15 Antandroy (Ambovombe), 16 Antankarana (Vohemar), 17 
Masikoro (Miary), 18 Antankarana (Antalaha), 19 Sakalava (Ambanja), 20 Sakalava (Majunga), 21 Sakalava 
(Maintirano) , 22 Betsimisaraka (Mahanoro) , 23 Antankarana (Ambilobe). 

Appendix B 



Below wc provide information on the people who furnished the data collected by the author at the beginning of 
2010 with the invaluable help of Joselina Soafara Nere. For any dialect two consultants have been independently 
interviewed. Their names and birth dates follow each of the dialect names. 



Table 1: Peopl 


e which furnished the data on Malagasy dialects 


MERINA 

(ANTANANARIVO) 


SERVA Maurizio 




ANTANOSY 
(TOLAGNARO) 


SO AFAR A Joselina Nere 
ETONO Imasinoro Lucia 


08 November 1987 
18 February 1982 


BETSIMISARAKA 
(FENOARIVO-EST) 


ANDREA Chanchette Geneviane 
RAZAKAMAHEFA Joachim Julien 


07 August 1985 
09 November 1977 


SAKALAVA 
(MORONDAVA) 


SEBASTIEN Doret 
RATSIMANAVAKY Christelle J. 


26 November 1980 
29 February 1984 


VEZO 
(TOLIARA) 


RAKOTONDRABE Justin 
RASOAVAVATIANA Claudia S. 


02 August 1972 
28 June 1983 


ZAFISIRO 

(FARAFANGANA) 


RALAMBO Alison 
RAZANAMALALA Jeanine 


11 June 1982 

03 February 1980 


ANTAIMORO 

(MANAKARA) 


RAZAFENDRALAMBO Haingotiana 
RANDRIAMITSANGANA Blaise 


24 July 1985 

05 February 1989 


ANTAISAKA 
(VANGAINDRANO) 


RAMAHATOKITSARA Fidel Justin 
FARATIANA Marie Luise 


24 April 1984 
17 August 1990 


ANTAMBOHOAKA 

(MANANJARY) 


RAKOTOMANANA Roger 
ZAFISOA Raly 


04 May 1979 
20 April 1983 


BETSILEO 
(FIANARANTSOA) 


RAMAMONJISOA Andrininina Leon Fidelis 16 April 1987 
RAKOTOZAFY Teza 25 December 1985 


BARA 
(BETROKA) 


RANDRIANTENAINA Hery Oskar Jean 
NATHANOEL Fife Luther 


17 Jenuary 1986 
26 May 1983 


TSIMIHETY 
(MANDRITSARA) 


RAEZAKA Francis 
FRANCINE Germaine Sylvia 


23 December 1984 
04 May 1985 


MAHAFALY 

(AMPANIHY) 


VELONJARA Larissa 
NOMENDRAZAKA Christian 


21 April 1989 
07 June 1982 


SIHANAKA 
(AMBATONDRAZAKA) 


ARINAIVO Robert Andry 
RONDRONIAINA Natacha 


06 Jenuary 1979 
27 December 1985 


ANTANKARANA 

(VOHEMAR) 


ANDRIANANTENAINA N. Benoit 
EDVINA Paulette 


06 August 1984 
28 Jenuary 1982 


ANTANKARANA 

(ANTALAHA) 


RANDRIANARIVELO Jean Ives 
RAZANAMIHARY Saia 


24 December 1986 
07 September 1985 


SAKALAVA 

(AMBANJA) 


CASIMIR Jaozara Pacific 
ZAKAVOLA M. Sandra 


03 April 1983 
17 July 1984 


SAKALAVA 

(MAJUNGA) 


RATSIMBAZAFY Serge 
VAVINIRINA Fideline 


17 May 1978 
23 June 1970 


ANTANDROY 

(AMBOVOMBE) 


RASAMIMANANA Z. Epaminodas 
MALALATAHINA Tiaray Samiarivola 


05 June 1983 
07 July 1984 


MASIKORO 

(ANTALAHA) 


MAHATSANGA Fitahia 
VOANGHY Sidonie Antoinnette 


22 March 1976 
12 October 1981 


ANTANKARANA 

(AMBILOBE) 


BAOHITA Maianne 

NOMENJANA HARY Jean Pierre Felix 


21 August 1984 
07 June 1980 


SAKALAVA 

(MAINTIRANO) 


HANTASOA Marie Edvige 
KOTOVAO Bernard 


02 November 1985 
06 October 1983 


BETSIMISARAKA 
(MAHANORO) 


RASOLONANDRASANA Voahirana 
ANDRIANANDRASANA Maurice 


24 September 1985 
03 April 1979 
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Figure 1: Pliylogenetic tree of 23 Malagasy dialects realized by Unweighted Pair Group Method Average (UP- 
GMA) . In this figure the name of the dialect is followed by the name of the town were it was collected. The 
phylogenetic tree shows a main partition of the Malagasy dialects into four main groups associated to different 
colors. The strict correspondence of this cladistic with the geography can be appreciate by comparison with 
Fig. 2. 
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Figure 2: Geography of Malagasy dialects. The locations of the 23 dialects are indicated with the same colors 
of Fig. 1. Any dialect is identified by the the name of the town where it was collected. 
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Figure 3: Average distance of the Malagasy dialects from all the others. The 23 dialects are colored as in 
figure 2. Highlands dialects (Antananarivo, Fianarantsoa and Betroka) together with south-east coast dialects 
(Mananjary and Manakara) show the smallest average distance. 
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Figure 4: Lexical distances of Malagasy dialects from Malay and Maanyan. The 23 dialects are indicated 
with the same colors of the map in Fig. 2. Highlands dialects together with south-east coast Mananjary and 
Manakara dialects show the smallest distance from both the two Indonesian languages. 
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